June 20, 2018

Unsupervised Learning

  • No labeled response variable
    • identify hidden structure or intrinsic pattern
    • clustering not classification
    • hard to check for validity of model
  • Models include
    • Association Rules
    • Cluster Analysis
    • Self-Organizing Maps
    • Principal Components Analysis (PCA)
    • Multidimensional Scaling

Association Rules

  • Identify association between variables based on their frequency together
    • Egg and milk are sold together. Egg-milk may have a higher frequency in a store
    • \(X = (X_1, X_2,..., X_p)\) that appear most frequently in the database
  • Model: maximize \[Pr [\cap_{j=1}^p (X_j \in s_j)]\]
    • \(s_j \subseteq S_j\), the support of \(X_j\)
  • Mostly applied to binary valued data \(X_j = 0, 1\)
    • referred to as market basket analysis

Association Rules

  • combination of \(X_1\) and \(X_2\) that occurs most are marked red
  • the picture demonstrates that we should use a set of values \(s_j\) not a single value of \(X_j\)
  • association rule itemset \(A =>\) itemset \(B\)

Terms Related to Association Rules

Case Study: Association Rules

library(arules)
data("Groceries")
inspect(head(Groceries, 2))
##     items                
## [1] {citrus fruit,       
##      semi-finished bread,
##      margarine,          
##      ready soups}        
## [2] {tropical fruit,     
##      yogurt,             
##      coffee}
  • Notice that Groceries data set is not a data frame. It is transactions type. To get your data into transaction format you can use
as(yourDataFrame, "transactions")

Item Frequency

# frequentItems <- eclat (Groceries, parameter = list(supp = 0.08, maxlen = 6))
# inspect(frequentItems)
itemFrequencyPlot(Groceries, topN=10, type="absolute")

Getting the Rules

rules <- apriori (Groceries, parameter = list(supp = 0.001, conf = 0.5))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 9 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [5668 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Sorting the Rules

rules_conf <- sort (rules, by="confidence", decreasing=TRUE) 
inspect(head(rules_conf))
##     lhs                     rhs                    support confidence     lift count
## [1] {rice,                                                                          
##      sugar}              => {whole milk}       0.001220132          1 3.913649    12
## [2] {canned fish,                                                                   
##      hygiene articles}   => {whole milk}       0.001118454          1 3.913649    11
## [3] {root vegetables,                                                               
##      butter,                                                                        
##      rice}               => {whole milk}       0.001016777          1 3.913649    10
## [4] {root vegetables,                                                               
##      whipped/sour cream,                                                            
##      flour}              => {whole milk}       0.001728521          1 3.913649    17
## [5] {butter,                                                                        
##      soft cheese,                                                                   
##      domestic eggs}      => {whole milk}       0.001016777          1 3.913649    10
## [6] {citrus fruit,                                                                  
##      root vegetables,                                                               
##      soft cheese}        => {other vegetables} 0.001016777          1 5.168156    10

Homework Problem

  • Obtain marketing data from the R package ElemStatLearn and find 3 association rules shown in page 494 of the text book.
library(ElemStatLearn)
data("marketing")

Cluster Analysis

  • Cluster analysis involves
    • grouping the objects based on the similarity of values of variables
    • objects in each cluster are closely related to each other than objects in another cluster
  • Some of the popular clustering methods include
    • \(K\)-means clustering
    • Hierarchical Clustering

\(K\)-means Clustering

  • All variables need to be quantitative
  • The distance between each point is computed as \[d(x_i, x_{i'})= \sum_{j=1}^p(x_{ij}-x_{i'j})^2\]
  • Within cluster variation for \(k\)th cluster \[W(C_k) = \sum_{x_i \in C_k}(x_i-\mu_k)\]
  • The goal is to minimize Total cluster variation \[\sum_{k=1}^K W(C_k)\]

\(K\)-means Clustering

The algorithm is as follows

  • Iteration 1
    • randomly assign
    • step \(a\)
      • compute mean
      • compute distances
    • step \(b\)
      • update cluster
  • Iteration \(i\)
    • repeat step \(a\) and \(b\)
    • stop when total variation \(\sum W(C_k)\) is minimized

Example: \(K\)-means Clustering

set.seed(3256)
dat <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
           matrix(rnorm(100, mean = 1.5, sd = 0.3), ncol = 2),
           matrix(rnorm(100, mean = 3, sd = 0.3), ncol = 2))
colnames(dat) <- c("x", "y")
plot(dat)

Example: \(K\)-means Clustering

cl <- kmeans(dat, 3)
plot(dat, col = cl$cluster)
points(cl$centers, col = 1:3, pch = 17, cex = 1.5)

How many Clusters?

  • Elbow method to determine number of clusters for \(K\)-means clustering
K <- 1:6
wss <- sapply(K, function(k){kmeans(dat, k)$tot.withinss})
plot(K, wss, type="b", pch = 19)

Hierarchical Clustering

  • Creates clusters based on distances
    • distances between two objects can be based on
      • euclidean distances
    • distance between two clusters can be based on
      • distances between within cluster averages
      • distances between within cluster maximums
  • The results of the clusters are represented by a tree diagram called Dendrogram
    • it helps choosing number of clusters by cutting the tree horizontally such that groups are clearly distinguished.

Example: Hierarchical Clustering

  • It does not require to determine number of clusters ahead of fitting the model
hc <- hclust(dist(dat), method="average")
plot(hc)

Self-Organizing Maps (SOM)

  • Also uses numerical data only
    • categorical data should be converted into numerical variable (such as dummy variable)
    • reduces a high dimensional data into a two dimensional map
    • map can be colored by any variable to show how that variable is helping to create a cluster

Example: Self-Organizing Maps (SOM)

library(kohonen)
data(wines)
dat_matrix <- as.matrix(scale(wines))
som_grid <- somgrid(xdim = 6, ydim=6, topo="hexagonal")
som_model <- som(dat_matrix, grid=som_grid)
plot(som_model, type="changes")

SOM Grid Counts

  • Grey color indicates empty nodes. That means we should decrease the grid size.
plot(som_model, type="count")

SOM neibhorhood distance

  • Larger distance indicates different cluster. Hence it should show the cluster boundary.
  • smaller distance means nodes are of similar clusters
plot(som_model, type="dist.neighbours")

SOM Weight Vectors

  • This should roughly show the cluster boundaries
plot(som_model, type="codes")

SOM Heatmap

myDat <- som_model$codes[[1]]
plot(som_model, type = "property", property = myDat[,1], main=colnames(myDat)[1])

SOM clustering

wss <- (nrow(myDat)-1)*sum(apply(myDat,2,var)) 
K <- 2:15
for (i in K) wss[i] <- sum(kmeans(myDat, centers=i)$withinss)
plot(K, wss[K], type="b", pch = 19)

Plotting SOM clusters

som_cluster <- cutree(hclust(dist(myDat)), 8)
# plot these results:
plot(som_model, type="mapping", bgcol = som_cluster, main = "Clusters") 
add.cluster.boundaries(som_model, som_cluster)

Principal Components Analysis (PCA)

library(ISLR)
nci.labs=NCI60$labs
nci.data <- NCI60$data
pr.out <- prcomp(nci.data, scale=TRUE)
plot(pr.out)

PCA Variance Analysis

Cols=function(vec){
  cols=rainbow(length(unique(vec)))
  return(cols[as.numeric(as.factor(vec))])
}
par(mfrow=c(1,2))
plot(pr.out$x[,1:2], col=Cols(nci.labs), pch=19, xlab="Z1",ylab="Z2")
plot(pr.out$x[,c(1,3)], col=Cols(nci.labs), pch=19, xlab="Z1",ylab="Z3")

Multidimensional Scaling (MDS)

  • Uses similar goal as PCA and SOM: reduce dimension of data into a lower dimension
    • a different approach than PCA and SOM
    • unlike PCA and SOM, we don't need complete data (\(x_j\)), we just need the distances \(d_{ij}\)
  • The dimensions (\(z_1, z_2, ..., z_N\)) are obtained by minimizing stress function \[S_M(z_1, z_2, ..., z_N) = \sum_{i \ne i'}(d_{ii'}-||z_i-z_{i'}||)^2\]
  • Once MDS is done, we can obtain the clusters using some clustering methods (say \(k\)-means) on new data set \(Z=(z_1, z_2, ..., z_N)\)

Example: Multidimensional Scaling (MDS)

data(swiss)
mds <- as.data.frame(cmdscale(dist(swiss)))
colnames(mds) <- c("mds1", "mds2")
library(ggpubr)
ggscatter(mds, x= "mds1", y = "mds2", label = rownames(mds))

Clustering using Multidimensional Scaling (MDS)

clust <- kmeans(mds, 3)$cluster %>% as.factor()
mds <- mds %>% mutate(groups = clust)
ggscatter(mds, x = "mds1", y = "mds2", label = rownames(swiss),color = "groups",
          ellipse = TRUE, ellipse.type = "convex")

Homework Problem

  • We want create cluster of similar states based on USArrests data?
    • How many clusters do you recommend?
    • Use Hierarchical, PCA, SOM, MDS and compare results
    • Create USA map and color the states based on your cluster of states
      • Are the states in same cluster geographically together?
data("USArrests")
head(USArrests)
##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7

Homework Problem Hint

d <- dist(USArrests)
dc <- hclust(d, method="average")
plot(dc)

Reading and References

  • Chapters 14, The Elements of Statistical Learning by Hastie at el.
    • to obtain the data in this book install R package ElemStatLearn
  • An Introduction to Statistical Learning with Applications in R by Casella at el.